You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
row_ids only needs to hold the BN rows for the current tile. This reduces the shared memory usage and also the need for the batch splitting.
I didn't expect this to have such a positive impact on performance. I'm not sure whether this is due to short-circuiting the row_id search, or allowing more workgroups to run concurrently, or just reducing shared memory traffic. I don't think we were hitting the batch splitting with pp512 for any of these models.
Wow I'm getting a huge 50% improvement on my W8100 + RX 470.
PR:
model
size
params
backend
ngl
threads
main_gpu
test
t/s
gpt-oss 20B MXFP4 MoE
11.27 GiB
20.91 B
Vulkan
100
8
1
pp512
235.11 ± 0.35
gpt-oss 20B MXFP4 MoE
11.27 GiB
20.91 B
Vulkan
100
8
1
tg128
35.62 ± 0.29
Master:
model
size
params
backend
ngl
threads
main_gpu
test
t/s
gpt-oss 20B MXFP4 MoE
11.27 GiB
20.91 B
Vulkan
100
8
1
pp512
153.46 ± 0.13
gpt-oss 20B MXFP4 MoE
11.27 GiB
20.91 B
Vulkan
100
8
1
tg128
35.55 ± 0.21
or just reducing shared memory traffic
I remember I tried doing the iq4_nl lut using subgroup shuffles instead of shared memory a while back and while it didn't make a difference with mat vec I got what I think was a 10-20% improvement with mat mul. Considering how often mat mul accesses shared memory likely shared memory traffic was the reason for that.
ggmlchanges relating to the ggml tensor library for machine learningtestingEverything test relatedVulkanIssues specific to the Vulkan backend
3 participants
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
row_ids only needs to hold the BN rows for the current tile. This reduces the shared memory usage and also the need for the batch splitting.
I didn't expect this to have such a positive impact on performance. I'm not sure whether this is due to short-circuiting the row_id search, or allowing more workgroups to run concurrently, or just reducing shared memory traffic. I don't think we were hitting the batch splitting with pp512 for any of these models.